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Abstract 


We study the stochastic multi-armed bandit problem with non-equivalent multiple 
plays where, at each step, an agent chooses not only a set of arms, but also their 
order, which influences reward distribution. In several problem formulations with 
different assumptions, we provide lower bounds for regret with standard asymp¬ 
totics 0(log t) but novel coefficients and provide optimal algorithms, thus proving 
that these bounds cannot be improved. 

1 Introduction 

Multi-armed bandit (MAB) is a common model to formulate problems of finding the tradeoff be¬ 
tween exploration and exploitation. Its stochastic formulation with multiple plays was originally 
considered in |[3- In this formulation, at each step of a game, an agent chooses m arms from an arm 
set A and observes the reward for each of them, which is a random variable whose distribution is a 
property of the arm. The agent’s goal is to minimize the expected cumulative regret over the first 
T steps, i.e., the difference between the expected cumulative reward of the observed arms for the 
optimal strategy, which relies on the complete information about the reward distributions of all the 
arms, and the chosen strategy, which relies on the past observations only. In the paper lO, theoretical 
analysis of the asymptotic behavior of the cumulative regret is provided. 

An important limitation of [HI is that the rewards of the chosen arms are supposed to be independent 
of the order the agent put them into the set. In many applications, on the contrary, the same arm can 
exhibit different reward distributions at different positions. In particular, problems of web search 
ranking ifThlfTSll . recommendations naini, and contextual advertising ll^ lT4ll are often formulated 
as MAB problems with documents, recommended items, and ads respectively as arms. Steps of the 
game correspond to the requests of users, the application (agent) chooses objects to show them in 
different slots (positions) of the web page, and the user’s interaction with an object (which defines 
its reward) clearly depends on the slot of the page the object is placed in. 

Some papers studied adversarial bandit settings with non-equivalent plays uniiiiiiiii. Some other 
studies ifThl [TSl [TSl flTll consider stochastic problem formulations and prove upper bounds for the 
regret of corresponding algorithms. All these algorithms follow a general scheme: they rank arms 
by some score which balances between exploration and exploitation, and choose the top arms for 
the slots in the order of the slots’ importance. Thereby, these algorithms use the same exploration 
rate to choose arms for different positions. However, it follows from that, in stochastic setting, 
even in the case of equivalent plays, an asymptotically optimal algorithm should explore only one 
arm at one step most part of time. 

In this paper, we consider several settings of the general stochastic non-contextual MAB problem 
with non-equivalent multiple plays. These settings (see Section |2] for description) differ by addi¬ 
tional restrictions on the parameter space of arms and the reward distributions of their lists. These 
assumptions were held in many above-mentioned works and handle a variety of application tasks. 
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In the chosen settings, we provide lower bounds for the asymptotic behavior of the cumulative re¬ 
gret in Section [2 and prove their tightness under additional reliable requirements by presenting an 
algorithm with the same regret asymptotic behavior. Importantly, the form of each lower bound 
gives an insight on the construction of optimal algorithms in some specific cases not covered by our 
algorithm. 

2 Problem Formalization 

Let us consider the following problem. There is a parameter space A equipped with continuously 
distributed random vectors F{a) with values in densities /(•, a), and finite expectations p(a) 
for each list a € A™ of values of a fixed length m. We require each component Fi of F{a) to 
be integrable; \xi\f{x,a)dx < oo. The case, where all distributions f{-,d) are discrete, can 
be considered as well by substituting probability functions for densities f{-,d) and substituting 
summation for integration everywhere in the paper. 

At the start, an agent is provided with the space A and arms 1,2,..., N, where each arm j is 
provided with an unknown parameter aj € A. We denote A := (oi,..., oat). At each step t, the 
agent chooses a list of different arms ttj = ( 7 rt(l),..., TTt{m)) ^ T^t{j 2 ) if ji ^ ^ 2 ) to fill 

a row of slots S = {1, ..., m} with them. We denote the set of all the lists of m different arms 
by n. Next, the agent observes a realization Ft of F ((!.„. J (Ft are independent over steps), where 
d,r(t) = ( 071 - 4 ( 1 )) ■ • ■) o,Trt{m))^ and further Utilizes it for choosing lists at future steps. Note that /(•, d) 
can be not invariant with respect to permutations, i.e., the order of the arms in the list is important. 
The agent’s goal is to minimize the cumulative regret Regj^ over the first T steps; 

Regy = rmaxEi?(d,r) — E ^Rt, 

where R{d) = U{F{d)), Rt = U{Ft), and U : M is the function of reward depending on 

the observed values. Splitting the standard notion of an observed reward into the observed values F 
and the reward R allows to handle the case of observing only the list reward (see lElIIl) as well as 
the cases when the agent observes a contribution of each individual arm to the list reward (see, e.g., 
Assumption|2| or other aspects of the interaction that can provide additional information on aj, e.g., 
the time to the first click or the session duration in the case of web services. The described problem 
setting generalizes the one considered in 121 to non-equivalent plays and a more general form of 
relation between observed values F and the optimized reward R. 

3 Lower Bounds for Regret 

Before presenting each of our results, we introduce some notations and additional assump¬ 
tions (on the space (A, {/(•, a)}agA”*)) this result relies on. The Kullback-Liebler divergence, 
-^(/(■)) !?(■)) = /r^ /(^) iog dx, is a widely used measure of dissimilarity between two distribu¬ 
tions. We denote I(d,h) = /(/(•, a), /(•, 6 )) for brevity. We assume that our space of distributions 
{/(■) ^(laeA™ satisfies the condition 0 < I{d,b) < 00 for any different d,b € A'". Following Q, 
we consider only uniformly good strategy, i.e., the ones with the cumulative regret of order o{T°‘) 
for any a > 0 and any A € A" . Assume, WLOG, that each of arms 1,..., m,..., n is included 
in at least one optimal list (one with the highest reward expectation max,rgn Eii(a,r)) and each of 
arms n + 1,..., N is not. We call arms from these two groups relevant and irrelevant respectively. 
We denote Aj := {tt G 11 : j G { 7 r(A:)}fegs}, d = (d(l), •.., d(m)) and use for the list of 

parameter values d with a substituted into the position k. 

Our first assumption is similar to (but weaker than) the combination of Equations 2.2 and 2.4 
from 12 . 

Assumption 1. Denseness condition: for any list do G A"*, slot k, finite set of lists A C A"*, and 
p > 0, there exists Qq G A s.t. (i) Ei?(do) < Ei?(dg^^^°^), (ii) for any list d G A and slot k' s.t. 
A EF(do), we have I{d, <{1 + p)I{d, a{fc'^so(fe)}). 

Assumption [T] states that we can improve performance of any list by substituting such a value into 
an arbitrary position, which is arbitrarily “close” to the replaced value in terms of the reward dis- 
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tributions if a set of lists. This assumption holds, e.g., if A = R, function /(a, b): —)• R is 

continuous and Ei?(d) is strictly monotone with respect to any a{k), k € S. Denote by the 

number of times list tt is used up to step T. The following lemma provides a lower bound for the 
regret in an implicit form and helps to obtain an explicit lower bound stated by Theorem[T]under an 
additional assumption. 


Lemma 1. Under Assumption\I\ for any uniformly good strategy and any A & , for any relevant 

arm i < n and any irrelevant arm j, the set of numbers satisfies the following 

inequality: 


lim inf y 

T —von • ^ 71 


T^oo 

Consequently, there exists a;: N x 11 


■I{a^ 




logT 


■>)>1 


such that, for all i < n, j > 


( 1 ) 

n, we have 


T —^■oo ^ 


> 1 and the cumulative regret over T steps satisfies 


where Reglir) 


lim inf > lim inf V 

T^-oo logT T-j-oo ■^Ti-en 

max^r'en Ei?(a7r') — Ei?(a7r). 


x{T,TT)Reg{-K), 


( 2 ) 


This result generalizes Theorem 3.1 from 0 to two issues; (i) a contribution of each arm to the list 
reward R{d) may be not observed; (ii) the reward of the list depends on the order of the arms. Note 
that Lemma[T]does not use any assumption on the relations between the regret distributions /(•, a) 
of different lists d, e.g., overlapping in their values. Intuitively, the less relations encoded in the 
space (A, {/(•, d)}a6A'") are, the higher the actual regret of the optimal strategy is (we informally 
call such relations correlation). In fact, the bound from Lemma[T]wilI be tight only in the case of 
“full information” (see, e.g.. Theorem^. One can give a formal definition for the opposite case of 
no information (omitted due to lack of space), when the observed rewards of one list of arms tell 
nothing about the reward distributions of the others. In this case, our problem setting reduces to 
the standard stochastic MAB problem with single plays by considering each list as a separate arm. 
Within it, the tight lower bound for the regret is provided in lfT2l Theorem 1]. Hence, we return to 
the setting with Assumption!!] 

An explicit bound on the regret can be found as the infimum of the right-hand side of Equation |2] 
over possible functions x{T, tt). We claim, omitting a rather standard proof, that there exists x(T, tt) 
which provides the minimum and have a finite limit ?/(7r) = limT^^oo x{T, tt) for each tt e H. To 
find the optimal values of y{TT), we consider the Karush-Kuhn-Tucker conditions for the minimiza¬ 
tion of the right-hand side of Equation [funder the constraints defined by Equation [1] for alH < n 
and j > n with lim inf E replaced by y{TT): 

= i or Aij = 0 for any j > n, i < n 

ZfegS = Reglrr) or = 0 for any tt G H 

Thus, the optimal values of could be found by comparing solutions of all the linear systems over 
different arms i, j and lists tt satisfying Xij = 0 and t/,r = 0 respectively. 

However, the minimum can be found more efficiently under the following assumption about de¬ 
composition of a list reward into the sum of the arms’ rewards, which is almost always accepted 
in the literature ||3]|l7l|9l[l6|[Ill, because it is satisfied by different measures of profit for many 
applications. 

Assumption 2. Decomposition condition: (i) R{d) = F(fc, d(fc)), where {F{i,d{k))}keS 

are independent, (ii) vector F(d) includes F{1, a(l)),..., F{m, d{m)) as its components. 


Eor example, most online measures of the web search quality cumulate some relevance gains over 
documents, e.g., clicks or their dwell times. Observability of the values F{1, a(l)),..., F{m, d{m)) 
(condition (ii)) is crucial for our analysis, since it allows to aggregate information about 
the plays of an arm in a slot regardless the arms chosen for other slots. We denote the 
distribution density of F{k,a) by f(-,k,a), introduce Ik{a,b) := I{f(-,k,a), f{-,k,b)) and 
Reg{k,j) := min.n.gn: 7 r(fc)=j ^ep(7r), and use A'^ for the set of arms which are placed in slot k 
in at least one optimal list. Assumption [fallows us to present the lower bound for the regret in the 
following simple form. 
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Theorem 1. Under Assumptions\I\and^ for any uniformly good strategy and any A € , 

Reg{k,j) 


T^oo logT 


j>n 


max mm 


i<n keS Ii^(aj,ai) 


( 4 ) 


This result is very intuitive; the maximization means that we should distinguish an irrelevant arm 
j from any relevant arm i, the minimization reflects the hope that we are able to make exploratory 
observations mostly in optimal slots, and the optimized component is standard. Note that Theorem[T] 
improves only the representation of the lower bound given by Lemma [T] but not the bound itself, 
what is impossible under Assumptions[T]and|2](as we prove in SectionlDl. 

On the other hand, adding requirement of the uncorrelation between reward distributions of an arm 
in different slots allows to obtain a higher lower bound in Theorem|2] The uncorrelation combined 
with Assumption[T]under Assumption|2]is formalized in the following assumption. 

Assumption 3. Uncorrelation-over-positions denseness condition: for any values a, oq € A, slot k 
and p > 0, there exists Og S A s.t. (i) MF{k,ao) < ET"(/c,ag), (ii) Ik{a,a'f) < (1 + p)Ik{a,ao), 
(Hi) for any slot k' f k, we have /(•, k', a) = /(•, k', Og). 


This assumption holds, e.g., if A = R*”, F(k, a) = G{a^) for a = (a^, ..., o’”), where a space 
of distributions {G(0)}6/gR is characterized by a strictly monotone (in 6) expectation function and a 
continuous function /(0i, 02 )- 

Theorem 2. Under Assumptions^and]^ for any uniformly good strategy and any A G A^, the 
number of plays NT{k^ j) of any irrelevant arm j > n in any slot k during the first T steps satisfies 
the following inequality for any arm i G A^.' 


T^-oo logT Ik[aj,ai) 


(5) 


Then the cumulative regret satisfies the following lower bound: 


lim inf 
T —¥OQ 


Regr 

logT 


> 


E 

j>n,k£S 


max 


Reg{k, aj) 
Ik{aj,ai) 


(6) 


Naturally, in order to distinguish the arm j from the arm i in the slot k, we need to play it in this 
slot the same number of times as in the standard SMAB problem with one play and the optimal arm 
i. Though reward distributions of an object in different slots seem to be dependent in practice, it 
may be of use for constructing a strategy to treat them as independent if the dependence is difficult 
to be inferred. As an example, one can consider a project with various tasks requiring different 
competencies and to be assigned to different workers from a big set of candidates, e.g., a football 
match, where a manager chooses players for different positions. We also note that max in Equation|6] 
disappears if there is the only optimal list. Now we prove our claims. 

Proof of Lemma [TJ At the first step of our proof, we use the change of measure tech¬ 
nique, like ||3] does, and prove that, for any irrelevant arm j > n, the vector of numbers 
{Nt{tt)/ logTj^gnj, with high probability, lays outside of some llljj-dimensional cuboids of the 
form {{x(7r)}7ren3 : 0 < Xtt < c(7r)}. At the second step, which is completely novel and crucial 
for the new issues, we aggregate these estimates to show that this vector is outside of a sequence of 
simplexes what in the limit provides Equation[T] 


Step 1. Consider any optimal arm list 7rg, any arm i G 7rg(S'), and any irrelevant arm j > n. 
According to Assumption[T]applied to the list TTg, the slot and the set of lists 11^, for a fixed 

p > 0, we can choose a value a* G A such that 




*} 


> (m) (1 -f p)/(a^,a^ 


>/K,d:) VTren,, (7) 


where we denote a* := ajf ^ for tt G IIj. We use the “alternative” parameter values 

A* = {oi,..., a*, ..., oat} to prove the following statement. 


Lemma 2. Consider any c = {c(7r)}^gnj satisfying ^ c{'K)I{aT^,a%) = 5 < We have 

Ti-en, 


lim 
T ^oo 


^ c(7r)}^ = 0. 


( 8 ) 
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The proof is based on the log odds ratio of the likelihood of the rewards 
i?i(d 7 r), ■ ■ ■,f?t(a 7 r) (observed at t plays of a list tt) under the parameter values di and 0 , 2 - 
it,TT (at, 02 ) = strong law of large numbers, we have 


/(a^,a*) = Ea log 


= lim 


, — = lim 

j(R{aTr),a*) J t-too t t->-oo t 

Consequently, for tt S IIj and events := {L,r, 7 r(a 7 rj d*) < (1 + p)/(d^, d*)clogT}, we have 


mux ir,7r(a'7r5 a^) 

r<t 


PA-a.s. 


lim iA(TT 
T —^■OO ^ 


T<clog T 


K:t) = 1- 


(9) 


Using Equation|9] we obtain Lemma|2]in the following way: 


I n-Gn,- 


. TVtItt) 
logT 




n < c(^)i N A- I n I (^ < n 


T —>-oo 


k TT^rin- 


logT 


T,c(7r) 

■,T 


r <c(7r) log T 


k ttGII,- 


1 - fdi n ”) s 1 n (^ <'4,^ | = o. no) 

irGHj T<c(ii-)logT \^cn. & 


To prove the last equality, we introduce, for any t = {r(7r)},n.gn s.t. t{tt) < c(7r), event 


Sj{T,c,t) := n^gn,({^r(7r) = j,) and find: 

Pa^{S,{T,c,t)) = [ l{S,{T,c,T)}dPA^ = [ l{S,{T,c,T)}e 


^7ren,j 


dPA > 


> y (l+p)E,.en, c{^)I{a.K)p^ > p_s(l+p)p^ ^ 

where the second equality uses the change of measure, which concerns only arm j, and the thirst 
inequality is based on the definition of j,. Since Ottgo < c(7r) log j, = 

U n i'- (i- c. t). where united sets are disjoint, we obtain from EquationfTTI 


7r:T(7r)<c(7r) log T 


p -1 n 


k -K^Hj 


NT{7r),T 


< I n 


k TZ^Yli 


NtM,T 


( 12 ) 

Equation |2] (i) implies that, under A*, any optimal list Tiopt belongs to Ilj. Since the strategy is 
uniformly good, we also have -Pa»{ < c} < = o{T^~^) for 

any c > 0. Therefore, Equation [12] implies that its left-hand side is = o(l), if we 

choose a G (0,1 — <5(1 + p)). 


Step 2. Fix any e > 0 and choose p > 0 such that 


(i+2p)(i-rp) 


> 1 — e. We obtain Equation[T]from 


- > Pa 

logT 


TrGn, 


V 


logT 


> 1 - e 


(1-e) 


T—^oo 


l-e. 


This convergence is equivalent to 

limT^o,P,4 ({iVT(^)/logT}^gn, ^ ^i-.) = 0 (13) 

for the simplex Si-e = |{a::(7r)}^gn, : Ettgo^ x(7r)/(a^, ai’' < i _ e,x(7r) > o|. 

Note that it can be covered by a finite union of cuboids C'f^i,( 7 r)}^^n ~ 

||x(7r)|^en,- : 0 < xItt) < c(7r)| contained in S i i.e., satisfying 


E^gh, c(7r)I(a^,a 


{tt (j)<-ai} 


) < 


(l+2p)(l+p)- 


Due to Equation |7] the latter condition 
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implies < i+ 2 p - Then, from Equation |8] for each of these cuboids, 

limr^oo Pa {{Nt{tt)I logrj^gjj, ^ C'{e(7r)},gn,) = 0 ^hat implies Equation[T3] 

Einally, Equation 1.1 from ifT^ yields liminf = liminf ^ EA^r(7r)i?ep(7r)/logT, and we 

obtain Equation|2]by minimizing this expression over {E7V7’(7r)}7’gN ,rGn satisfying Equation[T]for 
all i < n,j > n. Existence of an optimal function x{T,tt) is discussed above, before Equation |3 

□ 

Proof of Theorem [T] Under Assumption |2l we have /((a;^,..., a:™), a) = 
f{x^,a{l))... /(x™, d(m)), and, thus, for any list a € A™, slot k and value a* € A, 

= f /(x\ d(l)) ■. ■ /(x"", d(m)) log dx^...dx'^ = h{a{k), a*) (14) 

J j \x ) 

Then, we can rewrite Equation[T]as follows: 

liminfT^ooEy^_^^^7VT(fc, j)4(aj,a»)/logr > 1 (15) 

Eor further simplification of these restrictions, we utilize the following combinatorial lemma which 
claims that, in order to minimize regret under the fixed values of {Nrik, j)}jyn,keS, ^ strategy 
should not observe several arms j > n at one step. 

Lemma 3. There are m slots and m objects with some reward r(k,j) corresponding to an object 
j put in a slot k. Let consider t < m steps and a subset of different slots {fci,..., fct}- Assume 
we should close each of these slots at exactly one step. After it, at each step, we choose such a 
combination of different objects to put them in open slots (only one object in one slot) that maximizes 
the cumulative reward on this step. Then, one of the ways to reach the maximum cumulative reward 
over all the steps is to close one slot per step. 


Proof sketch. The idea of the proof is that, when closing just one slot at each step, we can repeat 
any combination {Nxik, which can be reached by any other strategy. We drop 

the accurate proof due to its technical nature. □ 

Then, while considering only rational strategies from Lemma[2 each play of the arm j in the slot k 
corresponds to a step with regret not less than Reg{k, j), what leads to the following estimate: 


limmfReg3./logT > liminfT^ooV'. ^Nxfk, j)Reg{k, j)/\ogT = 


T —¥oo 


(V{u} 


, , ij < n) = Inn inf E 


> liminf > 

T^oo 


j>n 


T —foo 

. Reg{k,j) 

mm —--r 

k^S 


NT{kJ)Ik{aj,atj) RegjkJ) 

Ik ) 


E 

kes 


fees logT 

NT{k,j)Ik{aj,ai^) 
logT 


> 


> > mi 

h e 


j>n 


RegjkJ) 

fees 


where the last inequality follows from Equation[T5] Taking maximum over all possible sets {ij}j>n 
yields Equation |4] □ 


Proof of Theorem 121 We describe a modification of Step 1 of the proof of Lemma[T]which proves 
the current theorem. Given an irrelevant arm j > n and an optimal list ttq with an arm i in a slot k, 
according to Assumption |3] we can choose such a value a* € A that 

EF(fc,ai) < EF(fc,a*), Ik{aj,a*) < (1 + p)Ik{aj,ai),\/k' f^k ff, k', aj) = /(•, k',a*) (16) 

Then, in the case of the arm parameters A* = {ai,..., aj-i,a*, Oj+i,..., qn}, the list ttq is op¬ 
timal and, since the probability to observe a fixed reward at some step in some slot differs under 
measures Pa and Pa* only if the slot is k with arm j in it at this step, we can estimate each 
value NT{k,j), k € S, separately. Indeed, by choosing c{tt) < (i_|_ 2 p)j^(a a*) 

7 r(fc) = j and putting c(7r) = +oo for other lists, we obtain estimates analogous to Equations fTTUHl 
resulting in limT’^oo Pa (^Nrikjj) < (i-|- 2 p)°^^a a*) ) = 0 the end of Step 1 of the proof of 

Lemma [T] Applying the first inequality from Equation [16] and letting p —> 0 yields Equation [5] 
Maximizing its right-hand side over i € A^, leads to Equation|6] □ 
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4 Asymptotically Optimal Algorithm 

In this section, we construct algorithms with asymptotically optimal regret reaching lower bounds 
from Lemma [T] and Theorems [T] and |2] In our construction, we rely on the algorithm proposed in 
13 under Assumptions[T]and|2]with additional constraint /{•, i, a) = /(•, a) and modify it to handle 
the case of non-equivalent plays. First, we supplement the setting of Theorem [T] with the following 
assumption under which we are able to present an optimal algorithm. 

Assumption 4. Factorization condition: the arm reward has a form F{k, a) = p(k)r{a), where p{k) 
is a Bernoulli random variable with a parameter dependent on k only, r(a} is a random variable 
with distribution dependent on a only. Besides, values of p{k), k G S, are components of F(a). 

This factorization model is often used in different applied problems, e.g., it corresponds to the exam¬ 
ination hypothesis i) underlying different models of user behavior on the web search result page. 
Under this hypothesis, the variable p{k) indicates whether the user examined the document, and r(a) 
measures the user satisfaction with it. One of possible ways to observe these values is to consider 
a click or a click on a lower position as a fact of examination and a satisfied click (e.g., one with 
a long enough dwell time or the last click in the session 0151 ) as a fact of user satisfaction. More 
general but similar factorization assumption was also considered in ifTTll . 

We denote Ep(fc) = pfc,Er(aj) = p,j and assume, WLOG, that pai > > Ma, > Bai+i = 

... = /ia^ = ... = Pan > Fa„+i > • ■ • > Majv and the slots 1 ,..., TO are ordered by decreasing 
Pk- Below we assume that the agent knows this order. Otherwise, it could sort the slots by current 
empirical estimates of pk- Errors of these estimates will not influence on the asymptotic behavior of 
the regret, due to the exponential convergence rate of the mean estimate provided by the Chernoff- 
Hoeffding bound; for iid random variables xi,... ,Xn with values in [0,1] and for any e > 0, 
P((xi -I- ... -f xk)/K — Exi < —e) < . 

Due to Assumption m the value of r(a) is observed only if p{k) = 1 for the corresponding slot. 
Further, when it is observed, its distribution does not depend on the slot. Then, we define arm- 
dedicated statistics pj^t and Uj^t from 0 which, in our case, are based not on all the plays of the 
arm j but only on all the observations of r{aj). First one pyt estimates expectation pj of r{aj): 

pyt = (j)’ where N'^ (j) is the number of observations of r{aj) during the first t 

steps and {oj) is the i-th observed value. The second statistics Uj^t = 9t,N* (j) (^i (oj) i i t'jv* (j)) 
(see definition of gt^s{Yi, ■ ■ ■, ^s) in Section IV of 0; we define Yi as an observation of r{a) for 
some a) is a kind of an upper confidence bound used in different MAB algorithms, e.g., UCB-1 0 
Bayesian-UCB ifTOl and is constructed to satisfy the asymptotic properties proved in Theorem 4.2 
in 0 under the following assumption. 

Assumption 5. (i) The space of reward distributions can be parametrized by 9 £ M., i.e., 
/(•, k, a) — /(•, 0), in such a way that log /(x, 9) is concave in 9 for each x. (ii) f x^/(x, 9) dx < 
oo. 


Based on the statistics /r_, * and Uj^t, we describe an asymptotically optimal algorithm under As¬ 
sumptions 0 |2] ID and |5] in Algorithm 0 Given values of the statistics, it chooses to arms to be 
observed as the algorithm from 0 does it and ranks them by decreasing ppt- Theorem0 states its 
optimality. 

Theorem 3. Algorithm 0 is asymptotically optimal under Assumptions 0 0 0 and 0 i.e., the 
asymptotics of its regret coincides with the lower bound from Theorem 0 lim inf = 

t->oo 

Ptti,) 


Proof. The following estimates show that, under Assumption 0 the latter asymp¬ 
totics corresponds to the lower bound: Ik{aj,ai) = pkl{aj,ai) > pkl{aj,am), 

Reg{k,aj) _ (Mafc -Ataj )(Pfc-P>= + l) + ... + (A»a,„_i ) (Pm-l -Pm)-KMam "Maj )Pm ^ 

Pk ~ Pk — 


(p.a„,-p.aA{{pk-Pk + l) + --- + (Pm-l-Pm)+Pm) T- .U f f .U . f 

- - - = Tam ~ Foj- Further proof differs from that of 

Theorem 5.1 from by the two issues: (i) an optimal arm list with a suboptimal order of arms 
provides the zero regret in the original case and a non-zero one in our case; (ii) in our case, a play of 
an arm j does not necessarily provide an observation of r{aj). Our proof consists of the following 
steps corresponding to steps from 0 . 
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Algorithm 1: Asymptotically optimal bandit algorithm under Assumptions[T][2ll4|and[5] 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 


Data: m, space (A, {/(-, a)} a£A'"), arm parameters A, slots S', 

Make m observations of rj (with p{k) = 1) for each arm j (in any slots); fo # of steps for it; 
Choose S e (0,pm/(2-/V^)); 

for t = to to T do 


j* ^ t%N ; 


// choose an arm uniformly over steps; x%y is a remainder of division of x by y, 


G <- 0; for j = 1 to do 
I if Nf (j) > St then G ■(— G U {j}; 
if|G| < m then Add different (m — |G|) arms to G randomly; 


TTf •(— list of top-TO arms from G by decreasing 

if j* G then 

I Show TTt := TT^; 

else 

if Uj*^t < Ai 7 rj(m),t then Show tt* := ttJ; else ShowTr* := (7r^(l), ... ,7rJ(TO - l),j*); 
Observe user feedback F{aTrt)', 

Result: Arm list at each step: 


• Step A. For each relevant arm i < l,we have EA^t(*, i) = T — o(log T). 

• Step B. EBt = T — o(log T), where Bt = < T | consists only of arms j < n}, where 

TTj is dehned in line 8 of Algorithm[T] 

• Step C. For any j > n and p > 0 there exists e > 0 such that logT, 

where 

Srij) = #{f < T I 7r'(i) = i 'ii < I, - Bi\ < e Vi < 

n, TTj consists only of arms i < n and r{aj) is observed at step t}. 


Now we explain how Steps A, B and C are combined to yield Theorem|3] Let consider the following 
particular case of the Chernoff-Hoeffding bound for independent observations of the indicator of an 
r{aj) observation given an arm j is played at a particular step: 

PiN:ij) < Ntij)p^/2) < (17) 

where Nt{j) is the number of plays of the arm j during the hrst t steps. Along with the condition 
S < Pm/ (2A^^), this estimate implies that both the expected number of steps of Algorithm [T] with 
active line 7 and the expected number of steps in line 1 are hnite. Combined with Steps A and B, 
this provides that the cumulative regret of Algorithm[T]is of order o(log T), except for steps counted 
by S'tO), J > n at Step C. The regret at these steps is at most 

+P + 0(1)) ^ (Ma™ -Ma,)(l+P) . ,, 


Letting p ^ 0 concludes the proof. 


□ 


Proof of Step A. Choose c > {N + 1){1 — 2N'^S /pm) ^ to provide [{B — B ^)/N] > 2N5B/pm 
for r G N and choose e < min{(pa, - p.aj)/‘2^,i < j < I', {Pai - Ma™)/2; (Po„ - Mo„+i)/2}. 
Lemmas 5.1 and 5.2 and their proofs could be transfered from 0 to our case without any changes. 
Now we change Lemma 5.3 from 0 by the following extended analysis. By Lemma 5.2, on ArBr, 
any armi < I satishes Nt{i) > [(c’’ —c’'“^)/A^] > 2St/pm for any stepf G [B, andr > r* for 

somer*. Then, we estimate N*(i) by EauationflT] P{Nf{i) < St \ ArBr) < In combina¬ 
tion with Lemma 5.1, it implies P(Cr) = 1 — o(c~'") for Gr = i'^) — St}ArBr- Further, 

at step t, on Gr, each arm i < I is included in tt[ in line 6 and, moreover, TT^{i) = i, i.e., it is played 
at its optimal position, since on Ar Algorithm[T]sorts ttJ perfectly in line 8. Finally, we can estimate 

S P(G,) = ES(minK+\f}-c’'-o(l))=f-o(lnf). □ 


Proof of Step B. Again, we transfer from 0 without changes Lemmas 5. LB and 5.2.B and the 
claim proved just after the proof of Lemma 5.2.B. Then, we apply the same trick as at Step A. □ 


Proof of Step C. Note that each observation of r{aj) counted by St{j) occurs in the slot m. Then, 
we transfer the proof of Step C from 0 to our case by changing the notion of a play of the arm j by 
the notion of an observation of r{aj). □ 
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Thus, we proved the tightness of the lower bound for the asymptotic behavior of regret provided by 
Lemma [T] and Theorem [T] Under Assumptions |2l 0 and |5] a construction of an optimal algorithm 
reaching the lower bound from Theorem |2] could be similar. Specificity is that (i) the agent should 
maintain statistics and Uk,j,t for each pair of a slot k and an arm j and (ii) if, at some step, the 
agent decides to substitute the arm from the special arm-slot pair for the arm j greedily chosen for 
this slot, it should find a greedy-optimal combination of arms for other slots again since it may now 
include j. 

5 Conclusion 

In this paper, we systematically studied the stochastic non-contextual multi-armed bandit problem 
with non-equivalent multiple plays. We considered some of the most interesting and, at the same 
time, quite general problem settings which are covered by our formulation and handle many applied 
problems. For them, we provided lower bounds for asymptotic behavior of the regret and proved 
tightness of these bounds. We believe that this work could be a basis both for finding theoretically 
optimal algorithms in more specific cases of our problem settings and for future development of 
applied algorithms. 
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